Computational inference of grammars for larger-than-gene structures from annotated gene sequences

نویسندگان

  • Guy Tsafnat
  • Jaron Schaeffer
  • Andrew Clayphan
  • Jonathan R. Iredell
  • Sally R. Partridge
  • Enrico W. Coiera
چکیده

MOTIVATION Larger than gene structures (LGS) are DNA segments that include at least one gene and often other segments such as inverted repeats and gene promoters. Mobile genetic elements (MGE) such as integrons are LGS that play an important role in horizontal gene transfer, primarily in Gram-negative organisms. Known LGS have a profound effect on organism virulence, antibiotic resistance and other properties of the organism due to the number of genes involved. Expert-compiled grammars have been shown to be an effective computational representation of LGS, well suited to automating annotation, and supporting de novo gene discovery. However, development of LGS grammars by experts is labour intensive and restricted to known LGS. OBJECTIVES This study uses computational grammar inference methods to automate LGS discovery. We compare the ability of six algorithms to infer LGS grammars from DNA sequences annotated with genes and other short sequences. We compared the predictive power of learned grammars against an expert-developed grammar for gene cassette arrays found in Class 1, 2 and 3 integrons, which are modular LGS containing up to 9 of about 240 cassette types. RESULTS Using a Bayesian generalization algorithm our inferred grammar was able to predict > 95% of MGE structures in a corpus of 1760 sequences obtained from Genbank (F-score 75%). Even with 100% noise added to the training and test sets, we obtained an F-score of 68%, indicating that the method is robust and has the potential to predict de novo LGS structures when the underlying gene features are known. AVAILABILITY http://www2.chi.unsw.edu.au/attacca.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Evolutionary and Phylogenetic Study of the BMP15 Gene

DNA sequence data contains a wealth of biologically useful information. Recent innovations in DNA sequencing technology have greatly increased our capacity to determine massive amounts of nucleotide sequences. These sequences can be used to specify the characteristics of different regions, interpret the evolutionary relationships between categorized groups, likelihood of performing multiple com...

متن کامل

P-215: Discovery of A Novel APA Variant of A Human Potential Gene Based on Expressed Sequenced Tags Analysis

Background: Expressed sequence tags (ESTs) are sequences of cDNA fragments prepared from different tissue sources. There are over one million of these sequences in the publicly available database, and these sequences are believed to represent more than half of all human genes. The ESTs belong to different cDNA libraries, was prepared from one particular cell type, organ, or tumor. Therefore, th...

متن کامل

In silico screening of G-Quadruplex Structures in Wilms tumor 1 Gene Promoter

Introduction: X-ray diffraction studies have revealed that guanines in a DNA stands may be arranged in quartet and form a structure called G-quadruplexs. Bioinformatics studies suggested the formation of G-quadruplex structure in human crucial genes, including Wilms tumor 1 (WT1). The aim of this study was to in silico analysis of the guanine-rich sequence in the promoter region of the WT1 gene...

متن کامل

Clustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information

Information theory is a branch of mathematics. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the biological structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy ...

متن کامل

Evaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes

Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded  DNA virus. There were two approaches for prediction of each Markov Model parameter,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 27 6  شماره 

صفحات  -

تاریخ انتشار 2011